Skip to content

Conversation

dkimds
Copy link

@dkimds dkimds commented Jul 1, 2025

Summary

Fixes #134 by adding debug mode support to benchmarks that were missing it.

Changes

  • ✅ AIME24: Added debug slicing [:2]
  • ✅ AIME25: Added debug slicing [:2]
  • ✅ AIW: Added debug slicing [:2]
  • ✅ AMC23: Added debug slicing [:2]
  • ✅ HMMT: Added debug slicing [:2]
  • ✅ MATH500: Added debug slicing [:2]

Testing

# Before: These would fail or ignore debug flag
python -m eval.eval --model hf --tasks AIME24 --debug --model_args "pretrained=microsoft/DialoGPT-medium"

# After: All work with 5 examples max
python -m eval.eval --model hf --tasks AIME24,AIME25,AIW,AMC23,HMMT,MATH500 --debug --model_args "pretrained=microsoft/DialoGPT-medium"

Implementation Pattern

Following the established pattern from MTBench and other working benchmarks:
python

if self.debug:
    examples = examples[:5]

Impact

✅ Consistent debug behavior across all benchmarks
✅ Faster development iteration (5 examples vs full dataset)
✅ Reduced compute costs during testing
✅ No breaking changes to existing functionality

AIME24, AIME25, AIW, AMC23, HMMT, MATH500
@dkimds
Copy link
Author

dkimds commented Jul 4, 2025

Hi @neginraoof, I noticed you recently reviewed a merged PR. Would you be able to take a look at my PR as well when you have some time? I’d really appreciate your feedback. Thanks!

@dkimds dkimds changed the title Add debug mode to 5 benchmarks Add debug mode support to 6 benchmarks (AIME24, AIME25, AIW, AMC23, HMMT, MATH500) Jul 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Debug mode fails for AIME24 and 5 other benchmarks
1 participant